{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "95f0a171",
   "metadata": {},
   "source": [
    "(categorical-data)=\n",
    "# Categorical Data\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In this chapter, we'll introduce how to work with categorical variables—that is, variables that have a fixed and known set of possible values. This chapter is enormously indebted to the **pandas** [documentation](https://pandas.pydata.org/).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51a55374",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove cell\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib_inline.backend_inline\n",
    "\n",
    "# Plot settings\n",
    "plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n",
    "matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17575f3a",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "This chapter will use the **pandas** data analysis package."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8757f369",
   "metadata": {},
   "source": [
    "## The Category Datatype\n",
    "\n",
    "Everything in Python has a type, even the data in **pandas** data frame columns. While you may be more familiar with numbers and even strings, there is also a special data type for categorical data called `Categorical`. There are some benefits to using categorical variables (where appropriate):\n",
    "\n",
    "- they can keep track even when elements of the category isn't present, which can sometimes be as interesting as when they are (imagine you find no-one from a particular school goes to university)\n",
    "- they can use vastly less of your computer's memory than encoding the same information in other ways\n",
    "- they can be used efficiently with modelling packages, where they will be recognised as potential 'dummy variables', or with plotting packages, which will treat them as discrete values\n",
    "- you can order them (for example, \"neutral\", \"agree\", \"strongly agree\")\n",
    "\n",
    "All values of categorical data for a **pandas** column are either in the given categories or take the value `np.nan`. \n",
    "\n",
    "## Creating Categorical Data\n",
    "\n",
    "Let's create a categorical column of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "535ef959",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame({\"A\": [\"a\", \"b\", \"c\", \"a\"]})\n",
    "\n",
    "df[\"A\"] = df[\"A\"].astype(\"category\")\n",
    "df[\"A\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62db4dec",
   "metadata": {},
   "source": [
    "Notice that we get some additional information at the bottom of the shown series: we get told that not only is this a categorical column type, but it has three values 'a', 'b', and 'c'.\n",
    "\n",
    "You can also use special functions, such as `pd.cut()`, to groups data into discrete bins. Here's an example where specify the labels for the categories directly:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "358c83bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"value\": np.random.randint(0, 100, 20)})\n",
    "labels = [f\"{i} - {i+9}\" for i in range(0, 100, 10)]\n",
    "df[\"group\"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87bc56b6",
   "metadata": {},
   "source": [
    "In the example above, the `group` column is of categorical type.\n",
    "\n",
    "Another way to create a categorical variable is directly using the `pd.Categorical()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb389105",
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_cat = pd.Categorical(\n",
    "    [\"a\", \"b\", \"c\", \"a\", \"d\", \"a\", \"c\"], categories=[\"b\", \"c\", \"d\"]\n",
    ")\n",
    "raw_cat"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebfdf815",
   "metadata": {},
   "source": [
    "We can then enter this into a data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0497fc16",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(raw_cat, columns=[\"cat_type\"])\n",
    "df[\"cat_type\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76dc9dd6",
   "metadata": {},
   "source": [
    "Note that NaNs appear for any value that *isn't* in the categories we specified—you can find more on this in {ref}`missing-values`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa76a795",
   "metadata": {},
   "source": [
    "You can also create ordered categories:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7520d3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "ordered_cat = pd.Categorical(\n",
    "    [\"a\", \"b\", \"c\", \"a\", \"d\", \"a\", \"c\"],\n",
    "    categories=[\"a\", \"b\", \"c\", \"d\"],\n",
    "    ordered=True,\n",
    ")\n",
    "ordered_cat"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "258cc6f5",
   "metadata": {},
   "source": [
    "## Working with Categories\n",
    "\n",
    "Categorical data has a `categories` and a `ordered` property; these list the possible values and whether the ordering matters or not respectively. These properties are exposed as `.cat.categories` and `.cat.ordered`. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.\n",
    "\n",
    "Let's see some examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2caba354",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"cat_type\"].cat.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f1fe093",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"cat_type\"].cat.ordered"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "694c07b0",
   "metadata": {},
   "source": [
    "If categorical data is ordered (ie `.cat.ordered == True`), then the order of the categories has a meaning and certain operations are possible: you can sort values (with `.sort_values`), and apply `.min` and `.max`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a6d7a90",
   "metadata": {},
   "source": [
    "### Renaming Categories\n",
    "\n",
    "Renaming categories is done via the `rename_categories()` method (which works with a list or a dictionary)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "097171b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"cat_type\"] = df[\"cat_type\"].cat.rename_categories([\"alpha\", \"beta\", \"gamma\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3de7fd23",
   "metadata": {},
   "source": [
    "Quite often, you'll run into a situation where you want to add a category. You can do this with `.add_categories()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ae6df38",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"cat_type\"] = df[\"cat_type\"].cat.add_categories([\"delta\"])\n",
    "df[\"cat_type\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba525491",
   "metadata": {},
   "source": [
    "Similarly, there is a `.remove_categories()` function and a `.remove_unused_categories()` function. `.set_categories` adds and removes categories in one fell swoop. One of the nice properties of set categories is that  Remember that you need to do `df[\"columnname\"].cat` before calling any cat(egory) functions though."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76b38a2a",
   "metadata": {},
   "source": [
    "## Operations on Categories\n",
    "\n",
    "As noted, ordered categories will already undergo some operations. But there are some that work on any set of categories. Perhaps the most useful is `value_counts()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19f5bdda",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"cat_type\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a36db155",
   "metadata": {},
   "source": [
    "Note that even though 'delta' doesn't appear at all, it gets a count (of zero). This tracking of missing values can be quite handy.\n",
    "\n",
    "`mode()` is another one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f52d5d0d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"cat_type\"].mode()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d28f34d8",
   "metadata": {},
   "source": [
    "And if your categorical column happens to consist of *elements* that can undergo operations, those same operations will still work. For example,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d43e94d",
   "metadata": {},
   "outputs": [],
   "source": [
    "time_df = pd.DataFrame(\n",
    "    pd.Series(pd.date_range(\"2015/05/01\", periods=5, freq=\"M\"), dtype=\"category\"),\n",
    "    columns=[\"datetime\"],\n",
    ")\n",
    "time_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db697f86",
   "metadata": {},
   "outputs": [],
   "source": [
    "time_df[\"datetime\"].dt.month"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c202f0d1",
   "metadata": {},
   "source": [
    "Finally, if you ever need to translate your actual data types in your categorical column into a code, you can use `.cat.codes` to get unique codes for each value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13e7bc66",
   "metadata": {},
   "outputs": [],
   "source": [
    "time_df[\"datetime\"].cat.codes"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
  },
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "md:myst",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
